22

Quantization of Neural Networks

MHSA

MLP

Add & Norm

Add & Norm

Classifier

Input  

query 

key

value 



 

 

  

Attention score 

 

Matrix

Multiplication

MHSA

MLP

Add & Norm

Add & Norm

L

L









Patch Embedding

Teacher activations

Distribution Guided Distillation (DGD)

Information Rectification Module (IRM)

FIGURE 2.4

Overview of Q-ViT, applying Information Rectification Module (IRM) for maximizing rep-

resentation information and Distribution Guided Distillation (DGD) for accurate optimiza-

tion.

inevitably deteriorates the attention module’s representation capability in capturing the in-

put’s global dependency. Second, the distillation for the fully quantized ViT baseline utilizes

a distillation token (following [224]) to directly supervise the quantized ViT classification

output. However, we found that such a simple supervision could be more effective, which

is coarse-grained because of the large gap between the quantized attention scores and their

full-precision counterparts.

To address the issues above, a fully quantized ViT (Q-ViT) [136] is developed by retain-

ing the distribution of quantized attention modules as that of full-precision counterparts (see

the overview in Fig. 2.4). Accordingly, we propose to modify the distorted distribution over

quantized attention modules through an Information Rectification Module (IRM) based

on information entropy maximization in the forward process. In the backward process, we

present a distribution-guided distillation (DGD) scheme to eliminate the distribution vari-

ation through attention similarity loss between the quantized ViT and the full-precision

counterpart.

2.3.1

Baseline of Fully Quantized ViT

First, we build a baseline to study fully quantized ViT since it has never been proposed in

previous work. A straightforward solution is quantifying the representations (weights and

activations) in ViT architecture in the forward propagation and applying distillation to the

optimization in the backward propagation.

Quantized ViT architecture.

We briefly introduce the technology of neural network

quantization. We first introduce a general asymmetric activation quantization and symmet-

ric weight quantization scheme as

Qa(x) =clip{(xz)x,Qx

n, Qx

p}⌉

Qw(w) =clip{ww,Qw

n , Qw

p }⌉

ˆx = Qa(x) × αx + z,

ˆw = Qw(w) × αw.

(2.13)

Here, clip{y, r1, r2} returns y with values below r1 set as r1 and values above r2 set as r2,

andyrounds y to the nearest integer. With quantization of activations on signed a bits

and weights to signed b bits, Qx

n = 2a1, Qx

p = 2a11 and Qw

n = 2b1, Qw

p = 2b11. In

general, the forward and backward propagation of the quantization function in the quantized